We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.
translated by 谷歌翻译
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
translated by 谷歌翻译
在各种计算机视觉任务(例如对象检测,实例分段等)中,无监督的域适应至关重要。他们试图减少域偏差诱导的性能下降,同时还促进模型应用速度。域适应对象检测中的先前作品尝试使图像级和实例级别变化对准以最大程度地减少域差异,但是它们可能会使单级功能与图像级域适应中的混合级功能相结合,因为对象中的每个图像中的每个图像检测任务可能不止一个类和对象。为了通过单级对齐获得单级和混合级对齐方式,我们将功能的混合级视为新班级,并建议使用混合级$ h-divergence $,以供对象检测到实现均匀特征对准并减少负转移。然后,还提出了基于混合级$ h-Divergence $的语义一致性特征对齐模型(SCFAM)。为了改善单层和混合级的语义信息并完成语义分离,SCFAM模型提出了语义预测模型(SPM)和语义桥接组件(SBC)。然后根据SPM结果更改PIX域鉴别器损耗的重量,以减少样品不平衡。广泛使用的数据集上的广泛无监督域的适应实验说明了我们所提出的方法在域偏置设置中的强大对象检测。
translated by 谷歌翻译
Existing deep learning-based traffic forecasting models are mainly trained with MSE (or MAE) as the loss function, assuming that residuals/errors follow independent and isotropic Gaussian (or Laplacian) distribution for simplicity. However, this assumption rarely holds for real-world traffic forecasting tasks, where the unexplained residuals are often correlated in both space and time. In this study, we propose Spatiotemporal Residual Regularization by modeling residuals with a dynamic (e.g., time-varying) mixture of zero-mean multivariate Gaussian distribution with learnable spatiotemporal covariance matrices. This approach allows us to directly capture spatiotemporally correlated residuals. For scalability, we model the spatiotemporal covariance for each mixture component using a Kronecker product structure, which significantly reduces the number of parameters and computation complexity. We evaluate the performance of the proposed method on a traffic speed forecasting task. Our results show that, by properly modeling residual distribution, the proposed method not only improves the model performance but also provides interpretable structures.
translated by 谷歌翻译
Spatiotemporal traffic data imputation is of great significance in intelligent transportation systems and data-driven decision-making processes. To make an accurate reconstruction on partially observed traffic data, we assert the importance of characterizing both global and local trends in traffic time series. In the literature, substantial prior works have demonstrated the effectiveness of utilizing low-rankness property of traffic data by matrix/tensor completion models. In this study, we first introduce a Laplacian kernel to temporal regularization for characterizing local trends in traffic time series, which can be formulated in the form of circular convolution. Then, we develop a low-rank Laplacian convolutional representation (LCR) model by putting the nuclear norm of a circulant matrix and the Laplacian temporal regularization together, which is proved to meet a unified framework that takes a fast Fourier transform solution in a relatively low time complexity. Through extensive experiments on some traffic datasets, we demonstrate the superiority of LCR for imputing traffic time series of various time series behaviors (e.g., data noises and strong/weak periodicity). The proposed LCR model is an efficient and effective solution to large-scale traffic data imputation over the existing baseline models. The adapted datasets and Python implementation are publicly available at https://github.com/xinychen/transdim.
translated by 谷歌翻译
The problem of broad practical interest in spatiotemporal data analysis, i.e., discovering interpretable dynamic patterns from spatiotemporal data, is studied in this paper. Towards this end, we develop a time-varying reduced-rank vector autoregression (VAR) model whose coefficient matrices are parameterized by low-rank tensor factorization. Benefiting from the tensor factorization structure, the proposed model can simultaneously achieve model compression and pattern discovery. In particular, the proposed model allows one to characterize nonstationarity and time-varying system behaviors underlying spatiotemporal data. To evaluate the proposed model, extensive experiments are conducted on various spatiotemporal data representing different nonlinear dynamical systems, including fluid dynamics, sea surface temperature, USA surface temperature, and NYC taxi trips. Experimental results demonstrate the effectiveness of modeling spatiotemporal data and characterizing spatial/temporal patterns with the proposed model. In the spatial context, the spatial patterns can be automatically extracted and intuitively characterized by the spatial modes. In the temporal context, the complex time-varying system behaviors can be revealed by the temporal modes in the proposed model. Thus, our model lays an insightful foundation for understanding complex spatiotemporal data in real-world dynamical systems. The adapted datasets and Python implementation are publicly available at https://github.com/xinychen/vars.
translated by 谷歌翻译
最近利用多模式数据旨在建立面部动作单元(AU)检测模型的研究。但是,由于多模式数据的异质性,多模式表示学习成为主要挑战之一。一方面,很难通过仅通过一个特征提取器从多模式中提取相关特征,另一方面,先前的研究并未完全探索多模式融合策略的潜力。例如,早期融合通常需要在推理期间存在所有方式,而晚期融合和中间融合则增加了特征学习的网络大小。与晚期融合的大量工作相反,早期融合探索渠道信息的作品很少。本文提出了一个新型的多模式网络,称为多模式通道混合(MCM),作为一种预训练的模型,以学习强大的表示形式,以促进多模式融合。我们在自动面部动作单元检测的下游任务上评估学习的表示形式。具体而言,它是一个单个流编码器网络,该网络在早期融合中使用频道混合模块,在下游检测任务中仅需要一种模态。我们还利用蒙版的VIT编码器从融合图像中学习特征,并使用两个VIT解码器重建两个模式。我们已经在两个公共数据集(称为BP4D和DISFA)上进行了广泛的实验,以评估所提出的多模式框架的有效性和鲁棒性。结果表明我们的方法是可比或优越的,它与最新的基线方法相当。
translated by 谷歌翻译
在本文中,我们研究了从许多嘈杂的随机线性测量值中恢复低级别基质的问题。我们考虑以下设置的设置,即基地矩阵的等级是未知的,并使用矩阵变量的过度指定的分组表示,其中全局最佳解决方案过拟合,并且与基础基础真相不符。然后,我们使用梯度下降和小的随机初始化解决了相关的非凸问题。我们表明,只要测量运算符能够满足受限的等轴测特性(RIP),其等级参数缩放具有地面真相矩阵等级,而不是使用过度指定的矩阵变量进行缩放,那么梯度下降迭代就会在特定的轨迹上朝向地面。 - 正确矩阵并在适当停止时获得了几乎信息理论上的最佳恢复。然后,我们提出了一种基于共同持有方法的有效的早期停止策略,并表明它可以检测到几乎最佳的估计量。此外,实验表明,所提出的验证方法也可以有效地用于图像恢复,并具有深层图像先验,从而使图像过度参与了深层网络。
translated by 谷歌翻译
多维时空数据的概率建模对于许多现实世界应用至关重要。然而,现实世界时空数据通常表现出非平稳性的复杂依赖性,即相关结构随位置/时间而变化,并且在空间和时间之间存在不可分割的依赖性,即依赖关系。开发有效和计算有效的统计模型,以适应包含远程和短期变化的非平稳/不可分割的过程,成为一项艰巨的任务,尤其是对于具有各种腐败/缺失结构的大规模数据集。在本文中,我们提出了一个新的统计框架 - 贝叶斯互补内核学习(BCKL),以实现多维时空数据的可扩展概率建模。为了有效地描述复杂的依赖性,BCKL与短距离时空高斯过程(GP)相结合的内核低级分解(GP),其中两个组件相互补充。具体而言,我们使用多线性低级分组组件来捕获数据中的全局/远程相关性,并基于紧凑的核心函数引入加法短尺度GP,以表征其余的局部变异性。我们为模型推断开发了有效的马尔可夫链蒙特卡洛(MCMC)算法,并在合成和现实世界时空数据集上评估了所提出的BCKL框架。我们的结果证实了BCKL在提供准确的后均值和高质量不确定性估计方面的出色表现。
translated by 谷歌翻译
没有人类在真空中开车。她/他必须与其他道路使用者进行谈判,以在社交交通场景中实现目标。理性的人类驾驶员可以通过隐式通信以社交兼容的方式与其他道路使用者进行互动,以便在互动密集型,关键的安全环境中平稳地完成其驾驶任务。本文旨在审查现有的方法和理论,以帮助理解和重新考虑人类驱动因素与社会自主驾驶之间的互动。我们进行此调查以寻求一系列基本问题的答案:1)道路交通场景中的社交互动是什么? 2)如何衡量和评估社会互动? 3)如何建模和揭示社会互动的过程? 4)人类驾驶员如何达成隐性协议并在社交互动方面平稳地谈判?本文回顾了建模和学习人类驱动因素之间的社会互动的各种方法,从优化理论和图形模型到社会力量理论以及行为和认知科学。我们还重点介绍了一些新的方向,关键挑战和未来研究的开头问题。
translated by 谷歌翻译